arXiv: 1505.06891vl [stat.AP] 26 May 2015 


Model-Based Geostatistics for Prevalence Mapping in 

Low-Resource Settings 


Peter J Diggle and Emanuele Giorgi 
(Lancaster Medical School, Lancaster University) 

May 27, 2015 


Abstract 

In low-resource settings, prevalence mapping relies on empirical prevalence data from 
a finite, often spatially sparse, set of surveys of communities within the region of interest, 
possibly supplemented by remotely sensed images that can act as proxies for environ¬ 
mental risk factors. A standard geostatistical model for data of this kind is a generalized 
linear mixed model with binomial error distribution, logistic link and a combination of 
explanatory variables and a Gaussian spatial stochastic process in the linear predictor. 
In this paper, we first review statistical methods and software associated with this stan¬ 
dard model, then consider several methodological extensions whose development has 
been motivated by the requirements of specific applications. These include: methods for 
combining randomised survey data with data from non-randomised, and therefore po¬ 
tentially biased, surveys; spatio-temporal extensions; spatially structured zero-inflation. 
Throughout, we illustrate the methods with disease mapping applications that have 
arisen through our involvement with a range of African public health programmes. 

Keywords: geostatistics; multiple surveys; prevalence; spatio-temporal models; zero- 
inflation. 


1 Introduction 


The term “geostatistics” is typically used as a convenient shorthand for statistical models 
and methods associated with analysing spatially discrete data relating to an unobserved spa¬ 
tially continuous phenomenon. The name derives from its origins in the South African mining 


industry (Krige, 1951) and its subsequent development by the late Georges M atheron and 


colleagues in L’Ecole des Mines, Fontainebleau, France (Chiles & Delhner, 2012). Geostatis 


tical methodology has since been applied in a wide range of scientific contexts, and is now 


widely accepted as one of three main branches of spatial statistics (Cressie, 1993). The de¬ 


scriptive phrase “model-based geostatistics” was coined by Diggle, Tawn & Moyeed (1998) 


to mean the embedding of geostatistics within the general framework of statistical modelling 
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and likelihood-based inference as applied to geostatistical problems. In contrast, “classical” 
Fontainebleau-style geostatistics has its own terminology and self-contained methodology, de¬ 
veloped largely independently of the statistical mainstream. 

Whether tackled through the model-based or classical approach, a typical feature of most 
geostatistical problems is a focus on prediction rather than on parameter estimation. The 
canonical geostatistical problem, expressed in the language of model-based geostatistics, is the 
following. Data {{iji^Xi) : i = l,...,n} are realised values of random variables Yi associated 
with pre-specihed locations Xi & A C. IR^. The are assumed to be statistically dependent 
on an unobserved stochastic process, {5'(a:) : x G IR^}, as expressed through a statistical 
model [S,Y] = [5'][F|5'], where [•] means “the distribution of,” Y = {Yi,...,Yn) and S = 
{S'(xi),..., S'(a:^)}. What can be said about the realisation of S'? The formal model-based 
solution is the conditional distribution, [S|y], which follows as a direct application of Bayes’ 
theorem, 

[S\Y] = [S][F|S]/ J [S][y|S]dS. 

By far the most tractable case is the linear Gaussian model, for which S is a Gaussian process 
and the Y^ given S are conditionally independent, I^IS ~ N(S(xj),r^). It follows that both 
the marginal distribution of Y and the conditional distribution of S given Y are multivariate 
Normal. 


Note that in the above formulation, no model is specihed for the Xi. We return to this point in 
the discussion, but in the meantime the implicit assumption is that the Xj are pre-specihed as 
part of the study-design or are located according to a process that is stochastically independent 
of 5”. If X = (xi,..., Xn) is stochastic, a complete factorisation is [S, X,Y] = [5'][X|S'][F|X, 5”]. 
Then, if [X|S'] = [X] and the properties of [X] are not of interest, it is legitimate to condition 
on [X] and so recover the previous formulation, [S',!"] = [S][y|S]. 


Diggle, Moraga, Rowlingson & Taylor (2013) argue that the geostatistical label should be 


applied more generally to scientihc problems that involve predictive inference about an un¬ 
observed spatial phenomenon S(x) using any form of incomplete information. This includes, 
for example, predictive inference for the intensity of a Gox process (Gox, 1955), and inference 
when X is both stochastic and dependent on S. 


In this paper, we restrict our substantive scope to the problem of analysing data from spatially 
referenced prevalence surveys. We also focus on prevalence mapping in low-resource countries 
where registry data are lacking. We argue that in low-resource settings the sparsity of the 
available data justihes a more strongly model-based approach than would be appropriate if 
accurate registries were available. 


2 The standard geostatistical model for prevalence data 

In its most basic form, a prevalence survey consists of visiting communities at locations x* : 
i = l,...,n distributed over a region of interest A and, in each community, sampling m* 
individuals and recording whether each tests positive or negative for the disease of interest. 
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If p{x) denotes prevalence at location x, the standard sampling model for the resulting data 
is binomial, Yi ~ Bin(mj,p(xi)) for i = Linkage of the p{xi) at different locations 

is usually desirable, and is essential if we wish to make inferences about p{x) at unsampled 
locations x. 

The simplest extension to the basic model is a binary regression model, for example a logistic 
regression model of the form 


\og[p{xi)/{l - p{xi)}] = d{xi)'l3, (1) 

where d{xi) is a vector of explanatory variables associated with the location Xj. This assumes 
that the value of d{x) is available not only at the data-locations Xi but also at any other 
location x that is of interest. When extra-binomial variation is present, two further extensions 
are possible. Firstly, a standard mixed effects model adds a random effect to the right-hand- 
side of 0 . to give 

\og[p{xi)/{l -p{xi)}] = d{xiy/3 + Zi, 

where the Z* are independent, N(0,r^) variates. Secondly, if the context suggests that 
covariate-adjusted prevalence should vary smoothly over the region of interest, we can add a 
spatially correlated random effect, to give 


log[p{xi)/{l -p{xi)}] = d{xi)'l3 Y S{xi) + Zi, 


( 2 ) 


where S = {S{x) : x E IR^} is a Gaussian process with mean zero, variance cr^ and correlation 
function p{x,x') = Corr{S'(a:), S'(x')}. We shall assume that the process S is stationary 
and isotropic, hence Corr{S'(x), S'(x')} = p(\\x — x'\\), where || • || denotes the Euclidean 
distance. The initial focus of inference within this model is the unobserved surface p{x) or 
specific properties thereof. In general, we call T = T{S) a target for predictive inference. For 
example, we may wish to delineate sub-regions of A where p{x) is likely to exceed a policy 
intervention threshold, in which case the target is T = {x : p(x) > c] for pre-specified c, and 
the required output from the analysis is the predictive distribution of the random set T. 


Equation Q defines what we shall call the standard geo statistical prevalence sampling model. 
Various approaches to fitting this model to geostatistical data have been suggested in the 
literature. Diggle, Tawn & Moyeed (1998) used Bayesian inference for parameter estimation 
and prediction, implemented by an MCMC algorithm. Rue, Martino & Chopin (2009) used 
integrated nested Laplace approximation (INLA) methods. The INLA methodology and its 
associated software yield accurate and computationally fast approximations to the marginal 
posterior distributions of model parameters and to the marginal predictive distributions of 
S{x) at any set of locations x, but not to their joint predictive distribution; this limits INLA’s 
applicability to point-wise targets T. Giorgi & Diggle (2014) provide an R package for Monte 
Carlo maximum likelihood estimation and plug-in prediction with an option to use a low- 
rank approximation to S for faster computation with large data-sets. The low-rank method 
approximates S by <S*, where 


'^*(^) = ^k)Vk- 

k = l 


(3) 
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In the Vk are independent N(0, r^) variates associated with a pre-specified set of locations 
Xk and /(x) is a prescribed function, typically monotone non-increasing in | |x| |. The covariance 
function of 5* is 

r 

Coy{S*{x), X(a:')} = '''^ X] ~ ^k)f{x' - Xk), (4) 

k=l 


Low-rank specifications have been proposed as models in their right; see, for example, [Higdon 
(1998, 2002). We consider them as approximations to a limiting, full-rank process. Taking 
the Xk in (|^ as the points of an increasingly fine regular lattice and scaling the function 
/(•) commensurate with the lattice spacing gives a limiting, full-rank process with covariance 
function 


Cov{<S'*(a:), / ^f{x — u)f{x' — u)du. 

9]R 


(5) 


From this perspective, the summation in (|^ represents a quadrature approximation to the 
integral in ([^. Note, however, that this construction admits only a sub-class of the allowable 
covariance functions for a spatially continuous Gaussian process. 


Gotway & Stroup 

(1997 

) suggest using generalized estimating equations ( 

Liang & Zeger, 

1986) 


when scientific interest is focused on the regression parameters rather than on prediction of S. 
However, in this approach the implicit target for inference is not the parameter vector (3 that 
appears in ([^, but rather the marginal regression parameter vector, X say. The elements of 
/3* are smaller in absolute value than those of (3 by an amount that depends on r^, cr^ and 
p{u). 


Higgle et al. (2007) use the standard model, but without the mutually independent random 


effects Zi, to construct predictive maps of the prevalence of Loa loa, a parasitic infection of 
the eye, in an area of equatorial west Africa covering Cameroon and parts of its neighbouring 
countries. Following Thomson et al. (2004) they include two remotely sensed covariates, 
height above sea-level and the Normalised Digital Vegetation Index (NDVI), as proxies for 
the ability of the disease vector, a particular species of Chrysops fly, to breed at each location. 
As described in Thomson et al. (2004) and Diggle et al. (2007), Loa loa prevalence mapping 
plays an important role in the implementation of a multi-national prophylactic mass-treatment 
programme for the control of onchocerciasis (river blindness), the African Programme for 
Onchocerciasis Control, APOC (WHO, 2012), following the recognition that a generally safe 
filaricide medication. Ivermectin, could produce severe, occasionally fatal, adverse reactions 
in people heavily co-infected with onchocerciasis and Loa loa parasites. As a result, APOC 
adopted the policy that in areas where Loa loa prevalence was greater than 20%, precautionary 
measures should be taken before local administration of Ivermectin. 


mapped the minimum mean square error point predictor, p{x) = E[p(x)|V] 
but also argued that a more useful quantity was the point-wise predictive probability, q{x) say, 
that p{x) exceeded 0.2, in line with APOC’s precautionary policy In addition to addressing 
directly the relevant practical problem, a map of q{x) conveys the uncertainty associated with 
the resulting predictions. This map, here reproduced as Figure [!} identifies large areas that 
almost certainly do and do not meet the policy-intervention criterion, but also delineates 
large areas where the only honest answer is “don’t know,” indicating the need for further 
investigation or, if practicalities dictate, taking an informed risk. 


Diggle et al. (2007 
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Figure 1: Predictive probability map of Loa loa prevalence in Cameroon and surrounding 
areas (adapted from Diggle et al. (2007)). Empirical prevalences at surveyed locations are 


indicated by size and colour coded dots. 


Other prevalence mapping applications of model-based geostatistics include: Claridge et al. 


(2012) on liver fluke and bovine tuberculosis in the UK cattle herd; Clements et al. (2006) 


on schistosomiasis in Tanzania; Diggle et al. (2002) on childhood malaria in the Gambia; 


Gemperli et al. (2004) on infant mortality in Mali; Gething et al. (2012) on the world-wide 
distribution of Plasmodium vivax; Hay et al. (2009) on the world-wide distribution of Plasmod¬ 


ium falciparium] Kleinschmidt et al. (2001) on malaria incidence in Kwazuku Natal, South 


Africa; Kleinschmidt et al. (2007) on HIV in South Africa; Soares Magalhaes & Clements 


(2011) on anemia in preschool-aged children in West Africa; Raso et al. (2005) on schistoso¬ 


miasis in Cote D’Ivoire; Pullan et al. (2011) on soil-transmitted infections in Kenya; Zoure 


et al. (2014) on river blindness in the 20 participating countries of the African Programme for 


Onchocerciasis control. 


3 Combining information from multiple surveys 

In order to obtain good geographical coverage of the population of interest, it is often necessary 
to combine information from multiple prevalence surveys. However, understanding the limita¬ 
tions of the sampling design adopted in each survey is crucial in order to draw valid inferences 
from a joint analysis of the data. In particular, non-randomized “convenience” surveys in 
which data are gathered opportunistically, for example at schools, markets or hospital clinics, 
may reach an unrepresentative sub-population or be biased in other ways. Nonetheless, con¬ 
venience samples represent a tempting, low-cost alternative to random samples. A combined 
analysis of data from randomised and convenience samples that estimates and adjusts for bias 
can be more efficient than an analysis that considers only the data from randomised surveys. 


5 







































































In a non-spatial context, Hedt & Pagano (2011) propose a hybrid estimator of prevalence that 


supplements information from random samples with convenience samples, and show that this 
leads to more accurate prevalence estimates than those available from using only the data 
from randomised surveys. 


Giorgi et ah (2015) develop a multivariate generalized linear geostatistical model to account 


for data-quality variation amongst spatially referenced prevalence surveys. They assume that 
at least one of the available surveys is a “gold-standard” that delivers unbiased prevalence 
estimates and for which the standard model ([^ is appropriate. Bias in a “non gold-standard” 
survey is then modelled using covariate information together with an additional, zero-mean 
stationary Gaussian process B — {B{x) : x G K^}. The resulting model for a non-randomised 
survey is 

\og\p{xi)/{l -p{xi)}] = d{xiyi3 + S{xi) + Zi + {d{xiyS + B{xi)}. (6) 


Data from both the randomised and the non-randomised survey then contribute to inference 
on the predictive target, d^x)'(3 + S{x). 


3.1 Application: using school and community surveys to estimate 
malaria prevalence in Nyanza Province, Kenya 


We now show an application to malaria prevalence data from a community survey and a 
school survey conducted in July 2010 in Rachuonyo South and Kisii Central Districts, Nyanza 
Province, Kenya. In the community survey, all residents above the age of 6 months were 
eligible for inclusion. A finger-prick blood sample was collected on each participant and 
examined for presence/absence of malaria parasites by a rapid diagnostic test (RDT). 


In the school survey, 46 out of 122 schools with at least 100 pupils were randomly selected using 
an iterative process to limit the probability of selecting school with overlapping catchment 
areas. All eligible children in attendance were included. In the community survey, residential 
compounds lying within 600 meters of each school were randomly sampled and all eligible 
residents in each sampled compound examined by the RDT. The design of the community 
survey delivers an unbiased sample of residents from the catchment area of each school, whereas 
the school survey is potentially biased by a plausible association between a child’s health 
status and their attendance at school. More details on the survey procedures can be found in 


Stevenson et ah (2013). 


In our analysis, we extracted information on sampled individuals between the ages of 6 and 25 
years in both surveys, as some adults have taken advantage of the introduction of free primary 
education in Kenya. The community survey included 1430 individuals distributed over 740 
compounds whilst the school survey included 4852 pupils distributed over 3791 compounds, 
i.e. averages per compund of approximately 1.9 and 1.3 people, respectively. Figure shows 
the locations of the sampled compounds from both surveys. 


For our joint analysis of the data from both surveys, we used exponential correlation func¬ 
tions for both S{x) and B(x), with (p and xp denoting the respective scale parameters. We 
parameterise the respective variances of S{x), B{x) and Z* as cr^, and cu^cr^. 
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Table 1: Explanatory variables used in the analysis of the Kenya malaria prevalence data. 


Term 

/^o Intercept 
Age in years 

(32 District (=1 if “Rachuonyo”; =0 otherwise) 

Socio-economic status (score from 1 to 5) 
ho Survey indicator, 1 if “school,” 0 if “community” (bias term) 
hi Age in years (bias term) 


For selection of significant explanatory variables we used ordinary logistic regression, retaining 
variables with nominal p-values smaller than 5%. Table gives the final set of explanatory 
variables included in the geostatistical model. The “District” indicator variable accounts for 
a known higher level of malaria risk in Rachuonoyo district. Socio-economic status (SES) is 
an indicator of household wealth taking discrete values from 1 (poor) to 5 (wealthy). 

Table [2] reports Monte Carlo maximum likelihood estimates and 95% confidence intervals 
for the model parameters. The /3-parameters reflect the district effect mentioned above as 
well as confirming a lower risk of malaria associated with higher scores of SES and greater 
age. The negative estimate of ho and its associated confidence interval indicate a significantly 
lower malaria prevalence in individuals attending school than in the community at large. 
The positive estimate and associated confidence interval for hi indicate that for individuals 
attending school, the negative effect of age is less strong than in the community. Figure 
[^a) shows point-wise predictions of B*{x) — exp{i3(a:)}, which represents the unexplained 
multiplicative spatial bias in the school survey for the odds of malaria at location x. Figure 
[^b) maps the predictive probability, r{x) say, that B*{x) lies outside the interval (0.9,1.1), 

r{x) = 1 — P (0.9 < B*{x) < 1.1|?/). (7) 

The lowest value of r{x) is about 87%, indicating the presence of non-negligible spatially 
structured bias throughout the study area. The joint analysis of the data from both surveys 
allows us to remove the bias and so obtain more accurate predictions for S{x) than would be 
obtained using only the data from the community survey. Figure |^a) shows a scatter plot 
of the standard errors for ^(a:) obtained from the joint model for the school and community 
surveys and from the model fitted to the community data only. Figure |^b) shows that 
locations for which the joint analysis produces a larger standard errors for S{x) correspond 
to areas where no observations were made. 


4 Analysing spatio-temporally referenced prevalence sur¬ 
veys 

In endemic disease settings where prevalence varies smoothly over time, joint analysis of data 
from surveys collected at different times can also bring gains in efficiency. The modelling 
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Figure 2: Geographical coordinates of the sampled compounds in the community (black 
points) and school (red points) surveys. 
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Figure 3: The predicted surfaces for B*{x) (a) and r{x) (b). 
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Table 2: Monte Carlo maximum likelihood estimates and corresponding 95% confidence in¬ 
tervals for the model fitted to the Kenya malaria prevalence data 


Estimate 95% Confidence interval 
Jo AAl2 

J -0.141 


P 2 

2.006 


-0.121 

^0 

-0.761 


0.094 

log(cr^) 

0.519 

log(i/2) 

-1.264 

log((/>) 

-3.574 

log(cc;2) 

-1.408 

log(V’) 

-3.366 


(-2.303, -0.521) 
(-0.174, -0.109) 
(1.228, 2.785) 
(-0.169, -0.072) 
(-1.354, -0.167) 
(0.046, 0.142) 
(0.048, 0.990) 
(-1.738, -0.790) 
(-4.083, -3.064) 
(-2.267, -0.550) 
(-4.178, -2.553) 


(a) 


(b) 


CD 

2 

CO 




Figure 4: (a) Scatterplot of the standard errors for S(x) using data from the community 
survey only (x-axis) and using both community and school survey data. Points coloured 
green or red he below or above the identity line y = x, respectively, (b) Prediction locations, 
coloured green or red at locations where the prediction variance for S{x) is smaller or larger, 
respectively, when using the data from both the community and school surveys. 
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framework in Giorgi et al. (2015) accommodates multiple surveys conducted at different, 
discrete times. The extension of (|^ to m surveys conducted at possibly different times is 


log\pk{xi)/{l-pk{xi)}] = dixikY (3 + Sk{xi) + Zik + 

I{k e B){d{xikyS + Bk{xi)}, k = l,...,m 


( 8 ) 


where B denotes the indices of the non-randomised surveys, Cov{S'fc(a:), = a‘^akk'P{x, x') 

and akk' = 1 if surveys k and k' are taken at the same time, — 1 < akk' < 1 otherwise. 

A different design for monitoring endemic disease prevalence is the rolling indicator survey 


(Roca-Feltrer et al., 2012). This consists of sampling members of a target population of 


individuals or households more or less continuously over time, the order of sampling being 
randomised. A natural model for the resulting data is a spatio-temporal version of ([^, 


log[p{xi,ti)/{l -p{xi,ti)}] = d{xi,ti)'l3 + S{xi,ti) + Zi, 


(9) 


where now (a:*, t*) denotes the location and time of the Ah sample member. There is an exten¬ 
sive literature on ways of specifying the covariance structure of a spatio-temporal Gaussian 
process; see, for example, Gneiting & Guttorp (2010). For endemic diseases, a reasonable 
working assumption is that the relative risk of disease at different times is the same at all 
locations, and vice versa. This implies an additive formulation, 

S(x,t) = S{x) + U{t), (10) 

where S{x) and U{t) are independent spatial and temporal Gaussian processes, respectively. 


4.1 Application: rolling malaria indicator survey in Chikwawa dis¬ 
trict, Malawi, May 2010 to June 2013 

We now analyse data from a rolling malaria indicator survey (rMIS) conducted in Ghikwawa 
District, Southern Malawi, from May 2010 to June 2013. In this rMIS, children under five 
years were randomly selected in 50 villages covering an area of approximately 400 km^. Blood 
samples were then collected and tested by RDT for malaria. The objectives of the analysis 
are the following. 

(i) interpolation of the spatio-temporal pattern of malaria prevalence for children under 
twelve months; 

(i) estimation of the reduction in prevalence and number of infected children through a 
scale-up in the distribution of insecticide treated nets (ITN) and delivery of indoor 
residual spraying (IRS), from the actual coverage to 100% coverage in each village. 

A practical distinction between these two objectives is that the first can only use explanatory 
variables that are available throughout the study-region, whereas the second can additionally 
use explanatory variables associated with the sampled households. 
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4.1.1 Spatio-temporal interpolation of malaria prevalence 

Let pj{xi,ti) denote the probability of having a positive RDT outcome for the j-th children 
in the z-th household at time t*. Using the model defined by ([^, the linear predictor assumes 
the form 


\og[pj{xi, - Pj{xi, ti)}] = Pq + I3idij + + Ps sin(27rU/12) + P 4 cos(27rtj/12) + 

/35sin(27rti/6) + /^e cos(27rtj/6) + S{xi) + U{ti), (11) 

where dij is a binary indicator that takes value 1 if the child is under twelve months and 0 
otherwise. The linear combination of sine and cosine functions with periodicities of one year 
and six months is used to model the seasonality of malaria. For both S{x) and U{t), we use 
isotropic exponential correlation functions with scale parameters (p and respectively. We 
use cr^ and to denote the variance of S{x) and U{t), respectively. 

Table (Model 1) reports the MCML estimates of the model parameters; for the positive¬ 
valued parameters cp, and ip we applied a log-transformation to improve the quadratic 
approximation to the log-likelihood. As expected, the estimate of jdi indicates a significantly 
lower risk of having a positive RDT outcome for children in the first year of life, as newborns 
benefit from maternally acquired immunity that gradually fades. 


Table 3: Monte Carlo Maximum Likelihood estimates for spatio-temporal models fitted to the 
Malawi malaria prevalence data. Model 1 is defined at equation(|TT|). Model 2 includes three 
additional explanatory variables: ITN, IRS and SES (Model 2). 


Model 1 

Model 2 

Term 

Estimate 95% Confidence interval 

Estimate 95% Confidence interval 

/do 

4.210 

(3.815, 

4.605) 

4.644 

(4.099, 5.189) 

A 

-5.380 

(-5.914, 

-4.847) 

-5.428 

(-6.066, -4.789) 

P2 

-0.067 

(-0.083, 

-0.051) 

-0.072 

(-0.090, -0.054) 

/33 

-0.749 

(-0.978, 

-0.521) 

-0.693 

(-0.922, -0.465) 


0.361 

(0.134, 

0.588) 

0.160 

(-0.070, 0.389) 


-0.099 

(-0.307, 

0.109) 

-0.260 

(-0.475, -0.045) 

Pe 

-0.168 

(-0.391, 

0.055) 

-0.062 

(-0.286, 0.162) 


- 

- 


-0.188 

(-0.492, 0.117) 


- 

- 


-0.181 

(-0.503, 0.141) 

^9 *** 

- 

- 


-0.079 

(-0.505, 0.347) 

log(cr2) 

0.899 

(-0.011, 

1.808) 

0.971 

(0.035, 1.906) 

log{(p) 

-3.624 

(-4.852, 

-2.397) 

-4.463 

(-5.769, -3.157) 

log{u^) 

-3.282 

(-4.199, 

-2.365) 

-3.118 

(-4.059, -2.177) 

log{ip) 

0.882 

(-0.170, 

1.934) 

1.118 

(0.017, 2.218) 


* ownership of at least one ITN; ** presence of IRS; *** SES (score from 1 to 5). 


We now generate prevalence predictions for five of the 50 villages in Chikwawa District. We 
chose these five villages selectively to include areas of low and high risk for malaria. Let Aj 
denote the convex hull obtained from the sampled locations of the Uth village. For a fixed 
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Figure 5: Estimated temporal trend of RDT prevalence for five villages in Chikwawa District. 
Figure!^ shows the location of each of these five villages. 


time t, we computed 


Pi{t) = \Ai\ ^ 


p{x, t) dx, for z = 1,..., 5 


( 12 ) 


where dij is fixed at 1 for all x E Ai and t = 1,2,... ,38, where each integer identifies a 
month, from May 2010 to June 2013. Also, p(x, t) is the mean of the predictive distribution 
of prevalence at location x and month t. For each village, i, we approximated the intractable 
integral in (12) using a quadrature method based on a regular grid covering the corresponding 
Ai. The results are shown in Figurewhere each pi{t) is plotted against t; a declining trend 
of RDT prevalence can be seen, with seasonal troughs and peaks around December-January 
and April-May, respectively. 


For any specified policy-relevant prevalence threshold p, a quantity of interest is the predictive 
probability that the estimated prevalence p(x, t) exceeds p. In Figure]^ we map the exceedance 
probabilities in June of each year for p — 0.2. Two areas of high and low prevalence are clearly 
identified. The former corresponds approximately to a flooding area where the the presence 
of local ponds also favours mosquito breeding. 


4.1.2 Estimating the impact of scaling-up control interventions 

The model ( [TT| that we used to predict malaria prevalence throughout the study-region nec¬ 
essarily excluded any covariate that was only available at the sampled locations. We now 
propose a procedure to estimate community-wide prevalence and number of infected children 
under a pre-defined control scenario, focusing on the effects of ownership of ITN and presence 
of IRS, and adjusting for a measure of each household’s socio-economic status (SES, scored 
from 1 to 5). We first fit a model with linear predictor of the same form in (0. but including 
these three additional explanatory variables. The resulting parameter estimates are shown in 


13 
























June 2010 June 2011 



34.72 34.76 34.80 34.84 34.72 34.76 34.80 34.84 


June 2012 June 2013 



34.72 34.76 34.80 34.84 34.72 34.76 34.80 34.84 


Figure 6: Maps of the predictive exceedance probabilities for a 20% malaria prevalence thresh¬ 
old in Chikwawa district; light blue lines correspond to waterways, with the Shire river repre¬ 
sented by the thicker line. 
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Table 1^ (Model 2). We then use enumeration data to obtain, for each village, the total number 
of children under five years and the number of households with at least one child under five 
years, and proceed as follows. 

(i) Allocate the number of children in each household. 

(ii) Impute geographical coordinates, ownership of ITN, presence of IRS and remaining 
explanatory variables for all unsampled children under the pre-defined control scenario. 

(hi) Generate values for all the model parameters using the asymptotic distribution of the 
maximum likelihood estimator, i.e. 

O~N(0, 

where 6 is the vector of model parameters and Jobs is the observed Fisher information 
as estimated by the negative Hessian of the Monte Carlo likelihood. 

(iv) Generate predictive samples for each child’s infection status and compute the mean of 
each sample as a point-estimate of the probability of infection for that child. 

(v) For each village, estimate of the number of infected children as the sum of the estimated 
child-specific probabilities of infection, and average these to estimate the village-level 
prevalence. 

We then repeat this process N times and, for each village, compute summary statistics of the 
N samples of estimated numbers of infected children and village-level prevalence. We applied 
this procedure under two different scenarios for April 2013, the most recent peak in RDT 
prevalence within the period covered by the data, as follows. 


51. Households having IRS and at least one ITN are equally distributed among sampled 
and unsampled households. 

52. Every household, whether sampled or unsampled, has IRS and at least one ITN. 

In Step (A), we imputed the locations of unsampled children by independent random sampling 
from the uniform distribution over each village area A*, defined as the convex hull of the 
sampled households’ locations. 

In scenario SI, we imputed age, ITN, IRS and SES by random sampling from the empirical 
villlage-level distribution of the sampled households. In scenario S2, only SES and age need 
to be imputed as ITN and IRS are both present in every household. The differences between 
estimated prevalences and between numbers of infected children under S2 and SI are reported 
in Eigure The main gains achieved by scenario S2 are in villages situated in the high 
prevalence area to the east of the Shire river. 
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Figure 7: Estimated reduction in prevalence (left panel) and number of infected children (right 
panel) for each of the 50 villages in Chikwawa District, assuming a scale-up in the distribution 
of ITN and IRS to 100% coverage. 

5 Spatially structured zero-inflation 


The standard geostatistical model for prevalence data in ([^ assumes binomial sampling vari¬ 
ation around the true prevalence, with a latent risk surface that approaches, but does not 
exactly reach, zero. However, empirical prevalence data often show an excess of zeros, i.e. 
zero-inflation. For diseases that are environmentally driven, one explanation for this is that 
some areas are fundamentally unsuitable for disease transmission. Hence, a zero prevalence 
estimate in a particular community can be either a chance flnding, or a necessary consequence 
of the community being disease/infection-free. Ways of handling spatially structured zero- 


inflation have been proposed in ecology (Agarwal et al. 

, 2002 

) and in specific epidemiological 

applications ( 

Amek et al., 

2011; 

Giardina et al., 

2012) 

. These approaches assume that the 


zero-inflation can be explained by regressing on a limited set of measured risk factors. In this 
extension to the standard geostatistical model ([^ for spatially varying prevalence, p{x), the 
distribution for the prevalence data Y conditional on S now takes the form of a mixture, 


P(y = y\S{xi)) 


[1 - 7r(xj)] -F 7r(xi)Bin(0; rrii, p{xi)) if y = 0 
7r(xj)Bin(?/; m,p{xi)) if y > 0 


(13) 


where 7 r(xi) G (0,1) denotes the probability that Xi is suitable for transmission of the disease, 
log[7r(xj)/{l — 'K^Xi)}] = d^XiY'j and Bm{y;m,p) denotes the probability mass function of a 
binomial distribution with probability of success p and number of trials m. The modelled 
prevalence at location x is p*{x) = tt{x)p{x). 
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An alternative way of specifying the conditional distribution of V given S is given by the so 
called “hurdle” model (Mnllahy, 1986). In this case the mixture distribution for V assumes 
the form 

'l-7r(xi) if p = 0 

P(Yi = y\S(xi)) = <J 7r(xi)Bin(y;m,p(xi)) ^ • (14) 


1 — Bin(0; m,p{xi)) 


if y > 0 


In our view, (14) is unsuitable for diseases mapping for the two following reasons. Firstly, 
the model does not distinguish between observing no cases amongst sampled individuals as 
a chance hnding or as a necessary consequence of the entire community being disease-free. 
Secondly, the model can generate unnatural patches of low prevalence around each sampled 
location for which no cases are observed amongst sampled individuals. 


A natural extension of the models in (13) and (14) that allows zero-inflation to depend on both 
measured and unmeasured covariates can obtained as follows. Define an additional stationary 
Guassian process T{x) such that 


log[7r(xi)/{l - T^{xi)}] = d{xi)'-i + T{xi). 

The spatial processes S{x) and T{x) can also be further decomposed as 


(15) 


S{x) = Ui{x)YV{x), 
T{x) = U 2 ix) + V{x) 


where Ui{x), U 2 {x) and V{x) are independent Gaussian proccesses. In this formulation, V{x) 
accounts for unmeasured factors that jointly affect the risk of the disease at a location x that 
is suitable for transmissionof the disease and the risk that x is itself suitable for transmission. 
However, identification of all of the resulting parameters requires a large amount of data. 
A pragmatic response is to assume that V{x) = f) for all x, i.e. that S{x) and T{x) are 
independent processes. 


5.1 Application: river-blindness prevalence mapping 


We now show an application to river-blindness prevalence data, previously analysed in Zoure 


et ah (2014). Here, we restrict our analysis to three of the twenty APOG countries, namely 


Mozambique, Malawi and Tanzania. Figure shows the locations of the sampled villages in 
the three countries. Red dots identify the 513 villages with no cases of river-blindness amongst 
sampled individuals, black dots the 397 villages with at least one case. 


We fit the model with conditional distribution for Y given by (13), and logistic link functions 
([^ and (15) for p{x) and 7r(x), respectively. We also assume that S{x) and T[x) are indepen¬ 


dent processes with ciovariance functions exp(—u/0) and a‘^uP‘ exp(—u/?/^), respectively; we 
denote the variance of the nugget effect Z by cr^zz^. We do include covariates, but simply fit 
constant means /ii and /i 2 on the logit-scale of p{x) and 7r(a:), respectively. 


Table shows the MGML estimates of the model parameters. The estimated scale of the 
spatial correlation of T{x) is much larger than that of S{x). Also, the estimate of the noise- 
to-signal ratio is substantial. 
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Figure 8: Sampled villages in Mozambique, Malawi and Tanzania, with balck and red dots cor¬ 
responding to villages with no observed case and at least one observed case of river-blindness, 
respectively. 


Table 4: MCML estimates of parametetrs in the zero-inflated geostatistical model and asso¬ 
ciated 95% confidence intervals. _ 

Term Estimate 95% confidence interval 


hi 

-5.812 

(-8.746, 

-2.877) 

h2 

2.287 

(1.361, 

3.213) 

log(cr^) 

-3.138 

(-4.075, 

-2.200) 

log ( 1 / 2 ) 

1.579 

(0.615, 

2.543) 

log{(j)) 

-2.899 

(-6.162, 

0.363) 

log(tu2) 

2.390 

(1.425, 

3.354) 

logii)) 

1.679 

(0.704, 

2.654) 


18 





Latitude 


(a) 



(b) 



28 30 32 34 36 38 40 42 


Longitude 



Figure 9: (a) Difference between predicted prevalences using the standard and zero-inflated 
geostatistical models, (b) Predicted surface ■n{x). 
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Figure]^ a) shows the difference between estimates of prevalence Ps{x) and Pz{x) based on the 
standard and zero-inflated, geostatistical models, respectively; these range between plus and 
minus 0.2. Figure [^b) shows the estimated surface of and indicates that the central 

and northern parts of Malawi are disease-free, whereas most of the reported zero cases in 
Mozambique and Tanzania are more likely to be attributable to binomial sampling error. 


6 Discussion 


We have discussed four important issues that arise in prevalence mapping of tropical diseases, 
namely: combining data from multiple surveys of different quality; spatio-temporal interpola¬ 
tion of disease prevalence; assessment of the impact of control interventions; and accounting 
for zero-inflation in empirical prevalences. For each issue we have presented an extension of 
the standard geostatistical model and have described an application that we have encountered 
through our involvement with public health programmes in Africa. 

In each application, we have used the MCML method for parameter estimation. This fitting 
procedure can be used under a very general modelling framework. Let IF* for z = 1,..., n 
denote a set of random effects associated with Yi, following a joint multivariate Normal dis¬ 
tribution with mean p and covariance matrix E. Assume that Yi conditionally on Wj are 
mutually independent random variables with distributions f{-\Wi). The likelihood function 
for the vector of model parameters 9 is given by 


m = [ g{W,y,e)dW 

jRdim(W) 

P n 

N{W;p,Y)l[f{yi\Wi)dW, 


/Mdim(W) 


where dim(VF) denotes the dimension of W. Note, for example, that in the model used 
in Section im the random effect associated with village i is a bivariate random variable, 
Wi — -|- Zi,T{xi)}, hence dim(IF) = 2 n with f{-\Wi) given by (13). Monte Carlo 


methods are then used in order to approximate the above intractable integral using importance 
sampling. As discussed in [Giorgi fc Diggle (2014), a convenient choice for the importance 
sampling distribution is g{W, y, 6 q) for some fixed 9q, which can be iteratively updated. With 
this choice, a Markov chain Monte Carlo (MCMC) algorithm is then required for simulation 
of Wi conditionally on yi under 9q. We used a Langin-Hastings algorithm that updates the 
transformed vector of random effects — IF), where IF and S are the mode and the 

inverse of the negative Hessian at IF of p(IF, r/;0o)- In each of the applications, diagnostic 
plots based on the resulting samples of IF* showed fast convergence of the MCMC algorithm; 
details are available from the authors. 


In the applications of Section 3T and Section 4T, we considered extra-binomial variation at 
household-level but not at individual-level within households. An extension of the standard 
geostatistical model ([^ that accounts for within-household random variation is 

log{pq/(l - Pij)} = a + [c'ijS + Uij] + [d{xiy/3 + S{xi) + Zi], 
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where i denotes household, j denotes individual within household, Qj is a vector of individual- 
specihc explanatory variables with associated regression parameters S and the Vij are mutually 
independent, zero-mean. Normally distributed random effects. However, when the data con¬ 
sists of empirical prevalences with small denominators, it is generally difficult to disentangle 
the effects of Z* and Ujj. For this reason we used the more pragmatic approach of setting 
Uij = 0 for all i and j. 


The results of Section |4.1.2| on the impact of scaling-up the distribution of ITN and IRS to 
a 100% coverage should be interpreted cuatiously. The procedure that we used to obtain 
estimates of prevalence and number of infected children under different scenarios does not 
deal with the issue of causation. The control scenarios SI and S2 represent virtual scenarios 
under which coverage of ITN and IRS is assumed to follow a pre-defined pattern without 
having any impact on other risk factors for malaria. In reality, a scale-up of ITN and IRS 
coverage may influence other features of the process, for example the extent to which ITNs 
are used correctly. 


Under model (13) that accounts for zero-inflation, the risk surface can approach, but not reach, 
zero. We are are currently working on two further extensions of the standard geostatistical 
model. In the first of these, prevalence can reach zero but is constrained to do so smoothly. 
The second allows discontinuities in risk between suitable and unsuitable areas of transmission. 
Spatial discontinuities may seem artificial but can give a better fit to the data, especially 
when the pattern of risk is highly non-linear. Statistical tools for automatic choice between 
non-nested models are available from both frequentist and Bayesian perspectives, but our 
preference would be to reach agreement with a subject-matter expert on what qualitative 
features of the model best reflect the behaviour of the underlying process. 
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